Using Syllables as Features in Morpheme Tagging in Swahili
نویسندگان
چکیده
Utilizing corpora to build morphological analyzers for the purposes of computational application has been addressed in many different ways. Methods for automated morphological analysis generally focus on segmentation from raw text, and ignore the actual learning of what morpheme features are present. Other methods are time-consuming and require a great deal of prior knowledge of the language such as constructing a grammar by hand using finite-state transducers. We seek to create an analyzer which identifies which features are present without explicit segmentation and analysis. In this paper, we propose utilizing the surface-level cues of the morpheme based on the character sequences which generally comprise it as a guide for statistical morphological tag assignment in Bantu languages. In Swahili, these surface-level cues are syllabic in nature. This is grounded in typological insights from Bantu phonology; morphemes are generally monosyllabic, open syllables. Furthermore, this insight from Bantu is mirrored in current phonological theory. Optimality Theory (OT)(Prince and Smolensky, 2004) is a constraint-based approach at deriving a phonetic output from its phonological output, and vice versa. The phenomenon discussed above relates very closely to the OT constraint known as MORPH=σ1. This stipulates a typological phonological proclivity to relate morphemes to no more or less than a single syllable. The success of the approach we propose, therefore, should reaffirm not only the utility of such an insight, but its role in cur-
منابع مشابه
Turkish PoS Tagging by Reducing Sparsity with Morpheme Tags in Small Datasets
Sparsity is one of the major problems in natural language processing. The problem becomes even more severe in agglutinating languages that are highly prone to be inflected. We deal with sparsity in Turkish by adopting morphological features for part-of-speech tagging. We learn inflectional and derivational morpheme tags in Turkish by using conditional random fields (CRF) and we employ the morph...
متن کاملThe SED heuristic for morpheme discovery: a look at Swahili
This paper describes a heuristic for morphemeand morphology-learning based on string edit distance. Experiments with a 7,000 word corpus of Swahili, a language with a rich morphology, support the effectiveness of this approach.
متن کاملRefining The SED Heuristic For Morpheme Discovery: Another Look At Swahili
This paper describes a heuristic for morphemeand morphology-learning based on string edit distance. Experiments with a 7,000 word corpus of Swahili, a language with a rich morphology, support the effectiveness of this approach.
متن کاملGeneralized unknown morpheme guessing for hybrid POS tagging of Korean
Most of errors in Korean morphological analysis and POS (Part-of-Speech) tagging are caused by unknown morphemes. This paper presents a generalized unknown morpheme handling method with P OSTAG (POStech TAGger) which is a statistical/rule based hybrid POS tagging system. The generalized unknown morpheme guessing is based on a combination of a morpheme pattern dictionary which encodes general le...
متن کاملUzbek-English and Turkish-English Morpheme Alignment Corpora
Morphologically-rich languages pose problems for machine translation (MT) systems, including word-alignment errors, data sparsity and multiple affixes. Current alignment models at word-level do not distinguish words and morphemes, thus yielding low-quality alignment and subsequently affecting end translation quality. Models using morpheme-level alignment can reduce the vocabulary size of morpho...
متن کامل